Methods, systems, and storage media for automatically identifying relevant chemical compounds in patent documents

ABSTRACT

Methods, systems, and non-transitory media for training a chemical entity recognition system to extract chemical compounds from a patent document and determine a relevance of the chemical compounds to the patent document are disclosed. A method includes obtaining patent documents from patent databases, normalizing each patent document into a unified format, and generating a chemical patent corpus. The chemical patent corpus includes chemical entities, each having relevancy annotations that indicate a relevance to the patent document from which the chemical entity is extracted. The method further includes providing the chemical patent corpus to the chemical entity recognition system, which tags the one or more chemical entities in a corresponding normalized patent document, extracts additional chemical entities, assigns a confidence score to each additional chemical entity, and labels each additional chemical entity as relevant or irrelevant to an associated patent document based on information contained in the chemical patent corpus.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims priority to U.S. Provisional PatentApplication No. 62/639,656, filed Mar. 7, 2018 and entitled “AUTOMATICIDENTIFICATION OF RELEVANT CHEMICAL COMPOUNDS FROM PATENTS,” thecontents of which is incorporated herein by reference in its entirety.

TECHNICAL FIELD

The present disclosure relates to methods, systems, and storage mediafor automatically identifying chemical compounds in patent documents,and more specifically, for training a chemical entity recognition systemto automatically extract chemical compounds from patent documents andclassify the chemical compounds' relevance with respect to thecorresponding patent documents.

BACKGROUND ART

Chemistry-related publications may include patent applications andscientific journal articles. In commercial research and developmentprojects, an initial public disclosure of new chemical compounds maytake place in patent applications. On occasion, it may takes anadditional 1 to 3 years for these chemical compounds to appear injournal publications. Therefore, these chemical compounds may only beavailable through patent documents for a period of time. Additionally,chemical patent documents may contain unique information such asreactions, experimental conditions, mode of action, bioactivity data,and catalysts. Analyzing such information may be necessary as it allowsthe understanding of compound prior art, it provides a means for noveltychecking and validation, and it points to starting points for chemicalresearch in academia and industry.

SUMMARY

One aspect of the present disclosure relates to a method of training achemical entity recognition system to extract one or more chemicalcompounds from a patent document and determine a relevance of the one ormore chemical compounds to the patent document. The method includesobtaining, by a processing device, a plurality of patent documents fromone or more patent databases. The method further includes normalizing,by the processing device, each patent document of the plurality ofpatent documents into a unified format to achieve a plurality of unifiedpatent documents. The method further includes generating, by theprocessing device, a chemical patent corpus from the plurality ofunified patent documents. The chemical patent corpus includes one ormore chemical entities extracted from the plurality of unified patentdocument. Each of the one or more chemical entities includes one or morerelevancy annotations. The one or more relevancy annotations indicates arelevance to the patent document from which the chemical entity isextracted. The method further includes providing, by the processingdevice, the chemical patent corpus to the chemical entity recognitionsystem. The chemical entity recognition system, in response to receivingthe chemical patent corpus, tags the one or more chemical entities in acorresponding normalized patent document of the plurality of unifiedpatent documents, extracts one or more additional chemical entities fromthe plurality of unified patent documents, assigns a confidence score toeach of the one or more additional chemical entities, and labels each ofthe one or more additional chemical entities as relevant or irrelevantto an associated patent document based on information contained in thechemical patent corpus.

Another aspect of the present disclosure relates to a system configuredfor training a chemical entity recognition system to extract one or morechemical compounds from a patent document and determine a relevance ofthe one or more chemical compounds to the patent document. The systemincludes one or more hardware processors and a non-transitory,processor-readable storage medium comprising one or more programminginstructions thereon. The programming instructions, when executed, causethe one or more hardware processors to obtain a plurality of patentdocuments from one or more patent databases. The programminginstructions, when executed, cause the one or more hardware processorsto normalize each patent document of the plurality of patent documentsinto a unified format to achieve a plurality of unified patentdocuments. The programming instructions, when executed, cause the one ormore hardware processors to generate a chemical patent corpus from theplurality of unified patent documents. The chemical patent corpusincludes one or more chemical entities extracted from the plurality ofunified patent document. Each of the one or more chemical entitiesincludes one or more relevancy annotations. The one or more relevancyannotations indicate a relevance to the patent document from which thechemical entity is extracted. The programming instructions, whenexecuted, cause the one or more hardware processors to provide thechemical patent corpus to the chemical entity recognition system. Thechemical entity recognition system tags the one or more chemicalentities in a corresponding normalized patent document of the pluralityof unified patent documents, extracts one or more additional chemicalentities from the plurality of unified patent documents, assigns aconfidence score to each of the one or more additional chemicalentities, and labels each of the one or more additional chemicalentities as relevant or irrelevant to an associated patent documentbased on information contained in the chemical patent corpus.

Yet another aspect of the present disclosure relates to a non-transitorystorage medium having executable instructions embodied thereon forcausing a processing device to obtain a plurality of patent documentsfrom one or more patent databases, normalize each patent document of theplurality of patent documents into a unified format to achieve aplurality of unified patent documents, and generate a chemical patentcorpus from the plurality of unified patent documents. The chemicalpatent corpus includes one or more chemical entities extracted from theplurality of unified patent document. Each of the one or more chemicalentities includes one or more relevancy annotations. The one or morerelevancy annotations indicate a relevance to the patent document fromwhich the chemical entity is extracted. The executable instructionsfurther cause the processing device to provide the chemical patentcorpus to the chemical entity recognition system. The chemical entityrecognition system tags the one or more chemical entities in acorresponding normalized patent document of the plurality of unifiedpatent documents, extracts one or more additional chemical entities fromthe plurality of unified patent documents, assigns a confidence score toeach of the one or more additional chemical entities, and labels each ofthe one or more additional chemical entities as relevant or irrelevantto an associated patent document based on information contained in thechemical patent corpus.

These and other features, and characteristics of the present technology,as well as the methods of operation and functions of the relatedelements of structure and the combination of parts and economies ofmanufacture, will become more apparent upon consideration of thefollowing description and the appended claims with reference to theaccompanying drawings, all of which form a part of this specification,wherein like reference numerals designate corresponding parts in thevarious figures. It is to be expressly understood, however, that thedrawings are for the purpose of illustration and description only andare not intended as a definition of the limits of the invention. As usedin the specification and in the claims, the singular form of ‘a’, ‘an’,and ‘the’ include plural referents unless the context clearly dictatesotherwise.

BRIEF DESCRIPTION OF DRAWINGS

FIG. 1 schematically depicts an illustrative network for a system fortraining a chemical entity recognition system to automatically extractchemical compounds from patent documents and determine a relevance ofthe chemical compounds to the patent documents according to one or moreembodiments shown and described herein;

FIG. 2A depicts a block diagram of illustrative internal components of atraining device configured to train a chemical entity recognition systemto automatically extract chemical compounds from patent documents anddetermine a relevance of the chemical compounds to the patent documentsaccording to one or more embodiments shown and described herein;

FIG. 2B depicts a block diagram of illustrative logic modules locatedwithin a memory of a training device that is configured to train achemical entity recognition system to automatically extract chemicalcompounds from patent documents and determine a relevance of thechemical compounds to the patent documents according to one or moreembodiments shown and described herein;

FIG. 2C depicts a block diagram of illustrative data components locatedwithin a storage device of a training device that is configured to traina chemical entity recognition system to automatically extract chemicalcompounds from patent documents and determine a relevance of thechemical compounds to the patent documents according to one or moreembodiments shown and described herein;

FIG. 3A depicts a block diagram of illustrative internal components of achemical entity recognition system that is trained to automaticallyextract chemical compounds from patent documents and determine arelevance of the chemical compounds to the patent documents according toone or more embodiments shown and described herein;

FIG. 3B depicts a block diagram of illustrative logic modules locatedwithin a memory of a chemical entity recognition system that is trainedto automatically extract chemical compounds from patent documents anddetermine a relevance of the chemical compounds to the patent documentsaccording to one or more embodiments shown and described herein;

FIG. 3C depicts a block diagram of illustrative data components locatedwithin a storage device of a chemical entity recognition system that istrained to automatically extract chemical compounds from patentdocuments and determine a relevance of the chemical compounds to thepatent documents according to one or more embodiments shown anddescribed herein;

FIG. 4 depicts a flow diagram of an illustrative general method oftraining a chemical entity recognition system to automatically extractchemical compounds from patent documents and determine a relevance ofthe chemical compounds to the patent documents according to one or moreembodiments shown and described herein;

FIG. 5 depicts a flow diagram of an illustrative method of classifyingrelevancy according to one or more embodiments shown and describedherein;

FIG. 6 depicts a flow diagram of an illustrative method of developing apatent corpus according to one or more embodiments shown and describedherein;

FIG. 7 depicts an illustrative user interface depicting annotations in apatent document using an annotation tool according to one or moreembodiments shown and described herein;

FIG. 8 depicts a chart indicating a performance of a chemical entityrecognition system based on precision, recall, and F-score according toone or more embodiments shown and described herein; and

FIG. 9 depicts a chart indicating a performance of a relevancyclassification system as a function of a relevance-score threshold whena relevancy feature is removed according to one or more embodimentsshown and described herein.

DESCRIPTION OF EMBODIMENTS

The present disclosure generally relates to a system that automaticallyextracts chemical compounds from a patent document and determines thechemical compound's relevance to that patent document. The processesdescribed herein relate to a training device that is particularlyconfigured to pull patent documents from a database, normalize thepatent documents, and feed the patent documents to a machine learningsystem (referred to herein as a chemical entity recognition system) suchthat the machine learning system, once trained, can automaticallyrecognize chemical compounds within the normalized patent documents anddetermine whether the chemical compounds are relevant or irrelevant tothe associated patent documents.

Patent data contained within patent documents can be obtained fromvarious patent databases, including, but not limited to, databasesmaintained by various patent offices such as the European Patent Office(EPO), the United States Patent and Trademark Office (USPTO), the WorldIntellectual Property Organization (WIPO), the Japan Patent Office (JPO)the State Intellectual Property Office (SIPO) of China, and the AfricanRegional Intellectual Property Organization (ARIPO). In someembodiments, patent databases may be maintained by non-governmentalentities, such as for example, Google. The information contained withindatabases maintained by non-governmental entities may be a copy ofinformation contained in various patent office databases. Accordingly,the term “patent database” as used herein generally refers to anydatabase that contains patent documents or patent data, including (butnot limited to) the databases noted hereinabove.

Depending on the patent authority, the data that is made available maybe in one or more formats, including, but not limited to, of XML, HTML,text PDF, Optical Character Recognition (OCR) PDF, image PDF, and thelike. Patent documents may follow a systematic structure consisting oftitle, bibliographic information (e.g., patent number, dates, inventorname(s), assignee(s), applicant(s), and International PatentClassification (IPC) classes), abstract, description, and claims. Insome embodiments, the chemical data contained within a patent documentmay be available in an experimental section of the description, whilechemical compounds that are claimed (i.e. protected by the patent) maybe available in the claim section. Drawings, sequences, or otheradditional information containing chemical data may be found at the veryend of the patent document (e.g., after a claims listing and anabstract).

While patent authorities make available the patent documents, they donot provide systematic continuous chemical annotations and full-textsearching capabilities, so manual or automatic excerption processes maybe considered. Manual excerption processes are costly and timeconsuming, and may therefore be limited to commercial content providers,such as, for example, Elsevier Reaxys (Elsevier B.V., Amsterdam NL).Automatic approaches to extract information from patents may extractimages and attachment files, but the extracted information may only bederived by text mining and image mining, may only be available forcertain patent documents published after a certain date (e.g.,information from digital chemical structure files provided by the USPTOfor a subset of its patents (granted patents from 2001 until 2011)).However, it proves difficult to maintain public databases and many ofthe automatic approaches have thus become outdated. Furthermore, suchautomatic approaches have limitations in the interpretation ofindividual drawing features (such as chemical bonds) found in thestructure diagrams of some images. Further, automatic approaches thatutilize text-mining focus on the recognition of chemical compounds inpatents, which is limited by the compounds contained in a dictionary.Addition of all systematic compound identifiers to a dictionary isimpossible as they are algorithmically generated based on the structureof a compound and a set of rules. Furthermore, correctness of theassociated chemical structure to a recognized compound is essential inthe field of chemistry. Often, a combination of the methods above in theform of an ensemble system is used for chemical compound recognition,which requires a gold-standard corpus for training, developing, andtesting performance. Producing such a corpus is laborious and expensive.It involves development of well-defined annotation guidelines, selectionand training of domain experts for annotation, selection of the data,annotation of the data by multiple annotators, and harmonization of theannotations.

Extracting information from patents automatically is fast but haslimitations. The majority of patent text-mining systems have beendeveloped, trained, and tested using the title and abstract of thepatent documents. Therefore, their usage is not evaluated on full-textdocuments. More importantly, automatic extraction is mostly focused onextraction of all chemical compounds mentioned. In manually excerpteddatabases, the focus is on relevant compounds. A compound is relevant toa patent when it plays a major role within the patent application (e.g.starting material or a product in a reaction specified in the claimsection). Relevant compounds are a small fraction of all the compoundsmentioned within the patent document. Automatic identification of therelevant compounds would greatly reduce the amount of extracted datafrom patents and can improve the usefulness of patent resources.Furthermore, these compounds can be used in predictive analyses toidentify the key compounds within the patent (key compounds are the maincompounds protected by the patent application and are usuallywell-hidden within the context).

Accordingly, the systems, methods, and media of the present disclosureidentify relevant chemical compounds in patent documents using anautomatic approach that determines whether a chemical entity is relevantor irrelevant to the patent document in which it is contained, whichminimizes the size of the database that is maintained to catalog theever-increasing amount of patent documents available, which allows thedatabase to be searched more efficiently, allows searching to returnmore relevant results, and is less costly to maintain. Other advantagesmay also be realized.

As used herein, the term “patent document” generally refers to anypatent related publication, including, but not limited to, publishedpatents (including utility patents, design patents, and plant patents),published patent applications, published utility models, publishedinnovation patents, published utility certificates, published pettypatents, published short term patents, published utility innovations,published functional designs, published utility certificates, and thelike. In some embodiments, a patent document may be a chemistry relatedpatent document containing chemical information therein. That is, achemistry related patent document may include, but is not limited to,one or more chemical symbols, one or more functional groups, anidentification of one or more chemical classes, an identification of oneor more chemical formulas, an identification of one or more chemicalstructural formulae, identification of one or more chemical prefixes,identification of one or more chemical suffixes, identification of oneor more chemical properties, any chemical nomenclature and/orterminology promulgated by the International Union of Pure and AppliedChemistry (IUPAC), and/or the like.

Referring now to the figures, FIG. 1 schematically depicts anillustrative network for a system for training a chemical entityrecognition system to automatically extract chemical compounds frompatent documents and determine a relevance of the chemical compounds tothe patent documents according to various embodiments. As illustrated inFIG. 1, a computer network 100 may include a wide area network (WAN),such as the Internet, a local area network (LAN), a mobilecommunications network, a public service telephone network (PSTN), apersonal area network (PAN), a metropolitan area network (MAN), avirtual private network (VPN), and/or another network. The computernetwork 100 may generally be configured to electronically connect one ormore computing devices and/or components thereof. Illustrative computingdevices may include, but are not limited to, a training device 110, achemical entity recognition system 120, one or more data repositories130, and/or a user computing device 140.

The training device 110 may generally be configured to train thechemical entity recognition system 120 and may further be configured totransmit and/or receive electronic data and/or the like from one or moresources (e.g., the chemical entity recognition system 120, the one ormore data repositories 130, and/or the user computing device 140),direct operation of one or more other devices (e.g., the chemical entityrecognition system 120, the one or more data repositories 130, and/orthe user computing device 140), collect data from one or more sources(e.g., patent document data, particularly chemical patent document datafrom the one or more data repositories 130 or the like), store datarelating to chemical entities located within patent documents,associated patent documents, data pertaining to relevance of a chemicalentity in a patent document, and/or the like. Additional detailsregarding the training device 110 are described herein. In someembodiments, the training device 110 may be able to communicate with oneor more other devices according to a client/server architecture and/orother architectures.

The chemical entity recognition system 120 is generally a machinelearning (ML) server that is particularly configured to receive datapertaining to chemical patent documents, analyze the data and extractchemical entities therefrom, and determine whether the extractedchemical entities are relevant to the chemical patent documents fromwhich they were extracted. The chemical entity recognition system 120may continuously receive data and/or instructions from one or more otherdevices of the computer network 100, including, but not limited to, thetraining device 110, the one or more data repositories 130, and/or theuser computing device 140. Additional details regarding the chemicalentity recognition system 120 are described herein.

The one or more data repositories 130 may generally store data that isused for the purposes of extracting chemical entities and determiningrelevance thereof, as described herein. That is, the one or more datarepositories 130 may contain patent documents, particularly chemicalpatent documents. In some embodiments, the data contained within the oneor more data repositories 130 may be third party servers that containinformation that can be used for the purposes of providing a dynamicallyranked recommendation list, which are accessible via an applicationprogramming interface (API) or the like by the training device 110, thechemical entity recognition system 120, and/or the user computing device140. For example, the one or more data repositories 130 may include oneor more repositories maintained by a patent office, such as, forexample, the USPTO, the EPO, the SIPO, the JPO, WIPO, and ARIPO. In someembodiments, data may be directly obtained from the one or more datarepositories 130 automatically and continuously for the purposes ofcarrying out the processes described herein. In other embodiments, datamay be copied from the one or more data repositories 130 to the trainingdevice 110 and/or the chemical entity recognition system 120 for thepurposes of carrying out the processes described herein.

The user computing device 140 may each generally be used as an interfacebetween a user and the other components connected to the computernetwork 100, and/or various other components communicatively coupled tothe user computing device 140 (such as components communicativelycoupled via one or more networks to the user computing device 140),whether or not specifically described herein. Thus, the user computingdevice 140 may be used to perform one or more user-facing functions,such as receiving one or more inputs from a user or providinginformation to the user. For example, the user computing device 140 mayreceive user inputs that correspond to researching patent documents(including chemical patent documents), researching chemical information,researching chemical entities, providing information, conducting varioussearches, and/or the like. Additionally, in the event that the trainingdevice 110 and/or the chemical entity recognition system 120 requireoversight, updating, or correction, the user computing device 140 may beconfigured to provide the desired oversight, updating, and/orcorrection. The user computing device 140 may also be used to inputadditional data into a data storage portion of the training device 110,the chemical entity recognition system 120, and/or the one or more datarepositories 130. For example, a user may use the user computing device140 to upload a patent publication to one or more components connectedvia the computer network 100. In some embodiments, the user computingdevice 140 may be configured to communicate with other platforms via aserver and/or according to a peer-to-peer architecture and/or otherarchitectures.

It should be understood that while the user computing device 140 isdepicted as a personal computer and the training device 110, thechemical entity recognition system 120, and the one or more datarepositories 130 are depicted as servers, these are nonlimitingexamples. More specifically, in some embodiments, any type of computingdevice (e.g., mobile computing device, personal computer, server, etc.)or any specialized device that has computing components may be used forany of these components. Additionally, while each of the devices isillustrated in FIG. 1 as a single piece of hardware, this is also merelyan example. More specifically, each of the training device 110, thechemical entity recognition system 120, the one or more datarepositories 130, and the user computing device 140 may represent aplurality of computers, servers, databases, mobile devices, components,specialized devices, and/or the like. Similarly, the one or more datarepositories 130 may be a single computer, server, database, mobiledevice, component, specialized device, and/or the like.

Illustrative hardware components of the training device 110 are depictedin FIG. 2A. A bus 200 may interconnect the various components, whichinclude (but are not limited to) a processing device 210, user interfacehardware 220, communications interface hardware 230, memory 240, and/ora storage device 260. The processing device 210, such as a computerprocessing unit (CPU), may be the central processing unit of thetraining device 110, performing calculations and logic operationsrequired to execute a program. The processing device 210, alone or inconjunction with one or more of the other elements disclosed in FIG. 2A,is an illustrative processing device, computing device, processor, orcombination thereof, as such terms are used within this disclosure. Thememory 240, such as read only memory (ROM) and random access memory(RAM), may constitute an illustrative memory device (i.e., anon-transitory processor-readable storage medium). Such memory 240 mayinclude one or more programming instructions thereon that, when executedby the processing device 210, cause the processing device 210 tocomplete various processes, such as the processes described herein. Insome embodiments, the program instructions may be stored on a tangiblecomputer-readable medium such as a compact disc, a digital disk, flashmemory, a memory card, a USB drive, an optical disc storage medium, suchas a Blu-ray™ disc, and/or other non-transitory processor-readablestorage media.

In some embodiments, the program instructions contained on the memory240 may be embodied as a plurality of software logic modules, where eachlogic module provides programming instructions for completing one ormore tasks. For example, certain software logic modules may be used forthe purposes of collecting information (e.g., information containedwithin patent documents, particularly chemical patent documents),extracting information (e.g., chemical entities from chemical patentdocuments), providing information (e.g., transmitting information to thechemical entity recognition system 120 (FIG. 1)), and/or the like.Additional details regarding the logic modules will be discussed hereinwith respect to FIG. 2B.

Still referring to FIG. 2A, the storage device 260, which may generallybe a storage medium that is separate from the memory 240, may containone or more data repositories for storing data pertaining to patentdocuments, particularly chemical patent documents, data pertaining tochemical entities, data pertaining to whether a chemical entity isrelevant an associated patent document, data that is transmitted to thechemical entity recognition system 120 (FIG. 1) for the purposes oftraining the chemical entity recognition system 120, data pertaining toannotations, and/or the like. Still referring to FIG. 2A, the storagedevice 260 may be any physical storage medium, including, but notlimited to, a hard disk drive (HDD), memory, removable storage, and/orthe like. While the storage device 260 is depicted as a local device, itshould be understood that the storage device 260 may be a remote storagedevice, such as, for example, a server computing device, the one or moredata repositories 130 (FIG. 1) or the like. Additional details regardingthe types of data stored within the storage device 260 are describedwith respect to FIG. 2C.

Still referring to FIG. 2A, the user interface hardware 220 may permitinformation from the bus 200 to be provided to a user, whether the useris local to the training device 110 or remote from the training device110 (e.g., a user of the user computing device 140 (FIG. 1)). Stillreferring to FIG. 2A, the user interface hardware 220 may incorporate adisplay and/or one or more input devices such that information isdisplayed on the display in audio, visual, graphic, or alphanumericformat and/or receive inputs. Illustrative input devices include, butare not limited to, a keyboard, a mouse, a joystick, a touch screen, aremote control, a pointing device, a video input device, an audio inputdevice, a haptic feedback device, and/or the like.

Referring to FIGS. 1 and 2A, the communications interface hardware 230may generally provide the training device 110 with an ability tointerface with one or more components of the computer network 100. Forexample, the training device may communicate with components of thecomputer network 100 via the communications interface hardware 230,including, but not limited to, the chemical entity recognition system120, the one or more data repositories 130, and/or the user computingdevice 140. Communication with external devices may occur using variouscommunication ports (not shown). An illustrative communication port maybe attached to a communications network, such as the Internet, anintranet, a local network, a direct connection, and/or the like.

It should be understood that the components illustrated in FIG. 2A aremerely illustrative and are not intended to limit the scope of thisdisclosure. More specifically, while the components in FIG. 2A areillustrated as residing within the training device 110, this is anonlimiting example. In some embodiments, one or more of the componentsmay reside external to the training device 110, either within one ormore of the components described with respect to FIG. 1, othercomponents, or as standalone components. Similarly, one or more of thecomponents may be embodied in other computing devices not specificallydescribed herein. In addition, while the components in FIG. 2A relateparticularly to the training device 110, this is also a nonlimitingexample. That is, similar components may be located within othercomponents without departing from the scope of the present disclosure.

Referring now to FIG. 2B, illustrative logic modules that may becontained within the memory 240 of the training device 110 (FIG. 2A) aredepicted. Still referring to FIG. 2B, the logic modules may include, butare not limited to, patent document obtaining logic 242, patent documentnormalization logic 244, patent corpus generating logic 246, patentdocument obtaining logic 248, scoring logic 250, and/or communicationslogic 252.

The patent document obtaining logic 242 generally contains programminginstructions for obtaining patent documents. That is, the patentdocument obtaining logic 242 may include programming for causing theprocessing device 210 (FIG. 2A) to access one or more data storagecomponents (e.g., the storage device 260 (FIG. 2A), the one or more datarepositories 130 (FIG. 1), and/or the like) and obtain patent documents,particularly chemical patent documents, therefrom. As such, the patentdocument obtaining logic 242 may include programming instructions thatallow for a connection between devices to be established, protocol forrequesting data stores containing data, instructions for causing thedata to be copied, moved, or read, and/or the like. Accordingly, as aresult of operating according to the patent document obtaining logic242, data and information pertaining to patent documents, particularlychemical patent documents, is available for completing various otherprocesses, as described in greater detail herein.

The patent document normalization logic 244 generally containsprogramming instructions for normalizing patent documents that have beenobtained from a plurality of sources. That is, the patent documentnormalization logic 244 contains programming instructions that causeinformation from patent documents, particularly chemical patentdocuments to be written in a unified format for later access, therebyresulting in a plurality of unified patent documents. Such a unifiedformat should be generally understood to be a format that is common toall of the patent documents, similar to a unidiff that is commonly usedin computing data comparison. Thus, the plurality of unified patentdocuments refers to a plurality of patent documents that have beenmodified to comply with the unified format. By way of non-limitingexample, normalizing each patent document may include converting theplurality of patent documents into a unified xml representation format,utilizing one or more predefined xml tags corresponding to heuristicinformation within the plurality of patent documents. It should beunderstood that predefined XML tags generally refer to custom tags thatdefine particular portions of a patent document that may be calleddifferent things in different countries or even from patent to patent inthe same database so that any object or section tagged with the customtag will be read according to the custom tag. For example, a particularbody of text may be referred to as a “detailed description” in onepatent document, a “detailed disclosure of the embodiments” in anotherpatent document, and a “disclosure” in a third patent document. Thepredefined XML tags may be set that all three of these bodies of textare recognized as being the same thing when read later on, as describedherein. As used herein, the term “heuristic information” refers to astatistic value associated with a particular portion of a patentdocument that represents the relative suitability of the portion amongits peers based on intuition, previous experience, common sense, and/orthe like, which may be developed, for example, based on machinelearning.

The patent corpus generating logic 246 generally contains programminginstructions for generating a corpus from the normalized documents thatare produced as a result of operating according to the patent documentnormalization logic 244. That is, the generated normalized documents arecollected into a corpus according to the patent corpus generating logic246. In some embodiments, the corpus is further stored in a datarepository in accordance with the programming instructions provided bythe patent corpus generating logic 246. In still further embodiments,the data may be stored separately from the data containing the patentdocuments and/or the data containing the normalized documents.

In some embodiments, the patent corpus generating logic 246 may furthercontain programming instructions for generating a chemical patent corpusfrom the plurality of unified/normalized patent documents. A chemicalpatent corpus is generally a corpus of unified/normalized documents (ordata extracted from documents that have been unified/normalized) thatcontain one or more chemical entities therein. In some embodiments, allof the unified/normalized documents may have chemical entities therein,and thus all may be included within the chemical patent corpus.Generating the chemical patent corpus may include, for example,identifying a chemical compound within text contained in each patentdocument of the plurality of normalized/unified patent documents.Generating the chemical patent corpus may also include accessing aphysical properties database and obtaining one or more physicalproperties of the identified chemical compound. It should be understoodthat a physical properties database is generally a database thatcontains data matching particular compounds to particular physicalproperties. For example the compound H₂O may be contained within thephysical properties database along with corresponding data relating tothe physical properties of water. Generating the chemical patent corpusmay also include generating a chemical structure corresponding to thechemical compound based on the one or more physical properties.Identifying the chemical compound may include utilizing adictionary-based approach and/or a morphology-based approach to identifythe chemical compound.

The morphology-based approach may include identifying one or moreelements within the chemical compound and combining the one or moreelements to create the chemical compound if the chemical compound isvalidated based on a structural chemistry of the chemical compound. Byway of non-limiting example, generating the chemical patent corpus fromthe plurality of normalized/unified patent documents may includeannotating each of the plurality of unified patent documents with one ormore of a chemical compound, a compound class, a suffix of a chemicalcompound, and a prefix of a chemical compound.

It should be understood that a chemical compound is a chemical substancecomposed of chemical elements held together by chemical bonds, includingmolecules (or molecular entities) held together by chemical bonds.Chemical compounds may be molecules held together by covalent bonds,ionic compounds held together by ionic bonds, intermetallic compoundsheld together by metallic bonds, or complexes held together bycoordinate covalent bonds. Chemical compounds may be expressed by achemical formula. By way of non-limiting example, the chemical compoundmay be selected from a mono-component compound, a compound mixture part,or a prophetic compound. A mono-component compound may include purechemical compounds such as, for example, systematic identifiers, trivialnames, elements, and chemical formulas. A compound mixture part may be aportion of compound that has a particular percentage of components (e.g.‘Magnesiaflux’, which scientifically is a mixture of 30% MgF₂ and 70%MgO). A prophetic compound is a specific compound that isuncharacterized within the text of a patent document and is mentioned inclaims portion of a patent document or a description portion of a patentdocument only for intellectual property protection.

A compound class can generally be any grouping of compounds based onparticular criteria. For example, chemical compounds may be classifiedaccording to the elements present in a compound (e.g., an oxide compoundclass may contain any chemical compound having one or more oxygen atoms,a hydride compound class may contain any chemical compound having one ormore hydrogen atoms, a halide compound class may contain any chemicalcompound having one or more halogen atom, and an organic compound classmay contain any chemical compound having a backbone of carbon atoms). Inanother example, chemical compounds may be classified according to thetype of bonds that a compound contains (e.g., an ionic compound classcontains compounds that are formed by attractive forces betweenoppositely charged ions such as salts, a molecular compound classcontains compounds that are formed with covalent bonds). In yet anotherexample, chemical compounds may be classified according to reactivity ofa particular compound (e.g., an acid compound class contains compoundsthat produce hydrogen ions (protons or H⁺ ions) when dissolved in water,a base compound class contains compounds that receive hydrogen ions whenformed). A suffix of a chemical compound refers to the ending of thename of the chemical compound. By way of non-limiting example, thecompound class may be selected from a chemical class, a biomolecule, apolymer, a mixture class, a mixture part class, or a Markush class. Itshould be understood that biomolecules are generally molecules and ionsthat are present in organisms, such as, but not limited to, proteins,carbohydrates, lipids, nucleic acids, metabolites, and/or the like. Itshould also be understood that a polymer is generally a substance thathas a molecular structure consisting chiefly or entirely of a largenumber of similar units bonded together, such as, for example, syntheticorganic materials used as plastics and resins. It should also beunderstood that a mixture class is a general class of mixture ofmaterials, such as, for example, a solution, a suspension, a colloid, orthe like. Similarly, a mixture part class refers to a class of partsthat make up a mixture (e.g., compounds that made up a portion of amixture). A Markush class generally refers to a class of compounds thatare accepted as being in the same Markush group, such as compounds thathave a single structural similarity, a common use, or the like.

In some embodiments, the patent corpus generating logic may containprogramming instructions for grouping one or more chemical entitiesextracted from the plurality of normalized/unified patent documents intoa particular corpus. It should be understood that the term “chemicalentity” generally refers to a physical entity of interest in chemistry,which includes, but is not limited to, molecular entities, partsthereof, and chemical substances. Each of the one or more chemicalentities may include one or more relevancy annotations. As described ingreater detail herein, a relevancy annotation is a generated annotationas to whether a particular chemical entity is relevant to the patentdocument from which it was extracted. The one or more relevancyannotations may include a relevant compound indicated for a propheticcompound or a Markush class. By way of non-limiting example, the one ormore relevancy annotations may include an irrelevant compound indicatedfor a compound mixture part, a mixture part class, a mixture class, apolymer, or a biomolecule. The one or more relevancy annotations for amono-component compound or a chemical class may be assigned based on acontext of the corresponding unified patent document. The one or morerelevancy annotations may indicate a relevance to the patent documentfrom which the chemical entity is extracted.

Referring to FIGS. 1 and 2B, the patent corpus providing logic 248generally contains programming instructions for providing the patentcorpus to another device in the computer network 100. For example, thepatent corpus providing logic 248 may contain programming instructionsthat allow data pertaining to the patent corpus to be transmitted to thechemical entity recognition system 120, the one or more datarepositories 130, and/or the user computing device 140.

The scoring logic 250 generally contains programming instructions forscoring each chemical entity contained within a patent corpus. That is,the scoring logic 250 contains programming instructions for assigning arelevance score, a confidence score, and/or the like to each chemicalentity within the patent corpus in response to a score received from thechemical entity recognition system 120, as described in greater detailherein.

The communications logic 252 generally contains programming instructionsfor communicating with one or more of the devices in the computernetwork. For example, the communications logic 252 may containcommunications protocol(s) for establishing a communications connectionwith the chemical entity recognition system 120, the one or more datarepositories 130, and/or the user computing device 140 such that dataand/or signals can be transmitted therebetween.

The logic modules depicted with respect to FIG. 2B are merelyillustrative. As such, it should be understood that additional or fewerlogic modules may also be included within the memory 240 withoutdeparting from the scope of the present disclosure. In addition, certainlogic modules may be combined into a single logic module and/or certainlogic modules may be divided into separate logic modules in someembodiments.

Referring now to FIG. 2C, illustrative types of data that may becontained within the storage device 260 are depicted. The types of datamay include, but are not limited to, patent document data 262, unifiedpatent document data 264, patent corpus data 266, chemical entity data268, and/or annotation data 270.

The patent document data 262 is generally data pertaining to patentdocuments, particularly chemical patent documents. In some embodiments,the data contained within the patent document data 262 may include fulltext documents received from one or more patent databases, such as thepatent databases described herein.

The unified patent document data 264 is generally data pertaining to theunified patent documents that have been normalized as described herein.In some embodiments, the data contained within the unified patentdocument data 264 may include full text documents having annotations, anassociated XML file, and/or the like that provides normalizationinformation, as described in greater detail herein.

The patent corpus data 266 is generally the data that is generated as aresult of creating a patent corpus, as described herein. In someembodiments, the patent corpus data 266 may be chemical patent corpusdata.

The chemical entity data 268 may include data pertaining to one or morechemical entities extracted from the plurality of unified patentdocuments. That is, the chemical entity data 268 may identify each ofthe chemical entities located within each patent document of the patentcorpus, may provide an associated structure, associated relevant names,associated categories, and/or the like.

The annotation data 270 generally includes data pertaining toannotations that are made with respect to the various chemical entitiesand/or patent documents within the patent corpus. For example, in someembodiments, each of the chemical entities may include one or morerelevancy annotations that indicate a relevance to the patent documentfrom which the chemical entity is extracted.

Illustrative hardware components of the chemical entity recognitionsystem 120 are depicted in FIG. 3A. A bus 300 may interconnect thevarious components, which include (but are not limited to) a processingdevice 310, user interface hardware 320, communications interfacehardware 330, memory 340, and/or a storage device 360. The processingdevice 310, such as a computer processing unit (CPU), may be the centralprocessing unit of the chemical entity recognition system 120,performing calculations and logic operations required to execute aprogram. The processing device 310, alone or in conjunction with one ormore of the other elements disclosed in FIG. 3A, is an illustrativeprocessing device, computing device, processor, or combination thereof,as such terms are used within this disclosure. The memory 340, such asread only memory (ROM) and random access memory (RAM), may constitute anillustrative memory device (i.e., a non-transitory processor-readablestorage medium). Such memory 340 may include one or more programminginstructions thereon that, when executed by the processing device 310,cause the processing device 310 to complete various processes, such asthe processes described herein. In some embodiments, the programinstructions may be stored on a tangible computer-readable medium suchas a compact disc, a digital disk, flash memory, a memory card, a USBdrive, an optical disc storage medium, such as a Blu-ray™ disc, and/orother non-transitory processor-readable storage media.

In some embodiments, the program instructions contained on the memory340 may be embodied as a plurality of software logic modules, where eachlogic module provides programming instructions for completing one ormore tasks. For example, certain software logic modules may be used forthe purposes of collecting information (e.g., information containedwithin patent documents, particularly chemical patent documents),extracting information (e.g., chemical entities from chemical patentdocuments), providing information (e.g., transmitting information to thetraining device 110 (FIG. 1)), learning what particular types ofinformation mean, and/or the like. Additional details regarding thelogic modules will be discussed herein with respect to FIG. 3B.

Still referring to FIG. 3A, the storage device 360, which may generallybe a storage medium that is separate from the memory 340, may containone or more data repositories for storing data pertaining to patentdocuments, particularly chemical patent documents, data pertaining tochemical entities, data pertaining to whether a chemical entity isrelevant an associated patent document, data that is transmitted to thetraining device 110 (FIG. 1), data pertaining to annotations, datapertaining to a confidence score, and/or the like. Still referring toFIG. 3A, the storage device 360 may be any physical storage medium,including, but not limited to, a hard disk drive (HDD), memory,removable storage, and/or the like. While the storage device 360 isdepicted as a local device, it should be understood that the storagedevice 360 may be a remote storage device, such as, for example, aserver computing device, the one or more data repositories 130 (FIG. 1)or the like. Additional details regarding the types of data storedwithin the storage device 360 are described with respect to FIG. 3C.

Still referring to FIG. 3A, the user interface hardware 320 may permitinformation from the bus 300 to be provided to a user, whether the useris local to the chemical entity recognition system 120 or remote fromthe chemical entity recognition system 120 (e.g., a user of the usercomputing device 140 (FIG. 1)). Still referring to FIG. 3A, the userinterface hardware 320 may incorporate a display and/or one or moreinput devices such that information is displayed on the display inaudio, visual, graphic, or alphanumeric format and/or receive inputs.Illustrative input devices include, but are not limited to, a keyboard,a mouse, a joystick, a touch screen, a remote control, a pointingdevice, a video input device, an audio input device, a haptic feedbackdevice, and/or the like.

Referring to FIGS. 1 and 3A, the communications interface hardware 330may generally provide the chemical entity recognition system 120 with anability to interface with one or more components of the computer network100. For example, the chemical entity recognition system 120 maycommunicate with components of the computer network 100 via thecommunications interface hardware 330, including, but not limited to,the training device 110, the one or more data repositories 130, and/orthe user computing device 140. Communication with external devices mayoccur using various communication ports (not shown). An illustrativecommunication port may be attached to a communications network, such asthe Internet, an intranet, a local network, a direct connection, and/orthe like.

It should be understood that the components illustrated in FIG. 3A aremerely illustrative and are not intended to limit the scope of thisdisclosure. More specifically, while the components in FIG. 3A areillustrated as residing within the chemical entity recognition system120, this is a nonlimiting example. In some embodiments, one or more ofthe components may reside external to the chemical entity recognitionsystem 120, either within one or more of the components described withrespect to FIG. 1, other components, or as standalone components.Similarly, one or more of the components may be embodied in othercomputing devices not specifically described herein. In addition, whilethe components in FIG. 3A relate particularly to the chemical entityrecognition system 120, this is also a nonlimiting example. That is,similar components may be located within other components withoutdeparting from the scope of the present disclosure.

Referring now to FIG. 3B, illustrative logic modules that may becontained within the memory 340 of the chemical entity recognitionsystem 120 (FIG. 3A) are depicted. Still referring to FIG. 3B, the logicmodules may generally be modules of a machine learning logic 341 module.Illustrative logic modules include, but are not limited to, chemicalentity extraction logic 342, chemical entity tagging logic 344,confidence score assigning logic 346, labeling logic 348, and/or scoringlogic 250.

The machine learning logic 341 may generally be a logic module thatincorporates one or more machine learning algorithms therein. Themachine learning algorithms contained within the machine learning logic341 and utilized by the chemical entity recognition system 120 (FIG. 3A)are not limited by the present disclosure, and may generally be anyalgorithm now known or later developed, particularly those that arespecifically adapted for generating a predictive model that can be usedfor determining a relevancy of a particular chemical entity to anassociated chemical patent document. That is, the machine learningalgorithms may be supervised learning algorithms, unsupervised learningalgorithms, semi-supervised learning algorithms, and reinforcementlearning algorithms. Specific examples of machine learning algorithmsmay include, but are not limited to, nearest neighbor algorithms, naïveBayes algorithms, decision tree algorithms, linear regressionalgorithms, supervised vector machines, neural networks, clusteringalgorithms, association rule learning algorithms, Q-learning algorithms,temporal difference algorithms, and deep adversarial networks. Otherspecific examples of machine learning algorithms within the machinelearning logic 341 should generally be understood and are includedwithin the scope of the present disclosure.

A predictive model that is generated as a result of operation of themachine learning logic 341 is generally be any machine learning modelnow known or later developed, particularly one that provides resultinginformation that can be used to determine a relevance of a chemicalentity to an associated chemical patent document. Illustrative examplesof machine learning models include, but are not limited to, aconvolutional neural network (CNN) model, a long short-term memory(LSTM) model, a neural network (NN) model, a dynamic time warping (DTW)model, or the like.

The chemical entity extraction logic 342 contained within the machinelearning logic 341 generally contains programming instructions forextracting chemical entities from a chemical patent document. That is,the chemical entity extraction logic 342 may contain programminginstructions for receiving a normalized/unified patent document from thecorpus of patent documents, analyzing the document, and determiningchemical entities contained within the document, as described in greaterdetail herein.

The chemical entity tagging logic 344 contained within the machinelearning logic 341 may generally contain programming instructions fortagging, annotating, or otherwise marking normalized/unified patentdocuments with data pertaining to chemical entities extracted therefrom,as described in greater detail herein.

The confidence score assigning logic 346 contained within the machinelearning logic 341 generally contains programming instructions forassigning a confidence score to each of the one or more chemicalentities. The confidence score generally represents a level ofconfidence pertaining to whether a chemical entity is relevant orirrelevant to a particular document based on various factors, asdescribed in greater detail herein.

The labeling logic 348 contained within the machine learning logic 341generally contains programming instructions for labeling, marking, orotherwise indicating additional chemical entities within a patentdocument that may not have been indicated by the training device 110(FIG. 1), as described in greater detail herein.

Still referring to FIG. 3B, the relevancy scoring logic 350 containedwithin the machine learning logic 341 generally contains programminginstructions for determining a relevancy of a chemical entity to thedocument from which it was extracted, as described in greater detailherein.

The logic modules depicted with respect to FIG. 3B are merelyillustrative. As such, it should be understood that additional or fewerlogic modules may also be included within the memory 340 withoutdeparting from the scope of the present disclosure. In addition, certainlogic modules may be combined into a single logic module and/or certainlogic modules may be divided into separate logic modules in someembodiments.

Referring now to FIG. 3C, illustrative types of data that may becontained within the storage device 360 are depicted. The types of datamay include, but are not limited to, patent corpus data 362, chemicalentity data 364, confidence score data 366, and/or relevance data 368.

The patent corpus data 362 is generally the data that is generated as aresult of creating a patent corpus, as described herein. In someembodiments, the patent corpus data 362 may be chemical patent corpusdata.

The chemical entity data 364 may include data pertaining to one or morechemical entities extracted from the plurality of unified patentdocuments, particularly additional entities extracted by the chemicalentity recognition system 120 (FIG. 3A). That is, the chemical entitydata 364 may be data that identifies each of the chemical entitieslocated within each patent document of the patent corpus, may provide anassociated structure, associated relevant names, associated categories,and/or the like.

The confidence score data 366 generally includes data pertaining toconfidence scores determined by the chemical entity recognition system120 (FIG. 3A). That is, the confidence score data 366 includes data thatrelates to a determined confidence that a chemical entity is relevant orirrelevant to the patent document from which it was extracted, asdescribed in greater detail herein.

The relevance data 368 generally includes data that indicates arelevance of each chemical entity to a patent document from which thechemical entity was extracted. For example, the relevance data 368 maybe a table or other similar data form that lists each of the chemicalentities extracted in a particular patent document along with anassociated indicator of relevance, as described in greater detailherein.

FIG. 4 depicts a block diagram of an illustrative method 400 of traininga chemical entity recognition system to automatically extract chemicalcompounds from patent documents and determine a relevance of thechemical compounds to the patent documents in accordance with one ormore implementations. The operations of method 400 presented below areintended to be illustrative. In some implementations, the method 400 maybe accomplished with one or more additional operations not described,and/or without one or more of the operations discussed. Additionally,the order in which the operations of method 400 are illustrated in FIG.4 and described below is not intended to be limiting.

In some implementations, the method 400 may be implemented by one ormore processing devices (e.g., a digital processor, an analog processor,a digital circuit designed to process information, an analog circuitdesigned to process information, a state machine, and/or othermechanisms for electronically processing information), such as theprocessing device 210 depicted and described herein with respect to FIG.2A and/or the processing device 310 depicted and described herein withrespect to FIG. 3A. Still referring to FIG. 4, the one or moreprocessing devices may include one or more devices executing some or allof the operations of method 400 in response to instructions storedelectronically on an electronic storage medium (e.g., the memory 240depicted and described with respect to FIGS. 2A-2B and/or the memory 340depicted and described with respect to FIGS. 3A-3B). The one or moreprocessing devices may include one or more devices configured throughhardware, firmware, and/or software to be specifically designed forexecution of one or more of the operations of the method 400.

Referring to FIGS. 1-4, at block 402, a plurality of patent documentsmay be obtained. In some embodiments, the plurality of patent documentsmay be obtained from one or more patent databases, such as, for example,the one or more data repositories 130. Operation according to block 402may be performed by one or more hardware processors configured bymachine-readable instructions including logic that is the same as orsimilar to the patent document obtaining logic 242, in accordance withone or more implementations.

At block 404, each patent document of the plurality of patent documentsmay be normalized into a unified format to achieve a plurality ofunified patent documents. Operation according to block 404 may beperformed by one or more hardware processors configured bymachine-readable instructions including logic that is the same as orsimilar to the patent document normalization logic 244, in accordancewith one or more implementations.

At block 406, one-to-one mapping between each character in the originaltext of each patent document and the corresponding character in thenormalized patent document may be stored. Operation according to block406 may be performed by one or more hardware processors configured bymachine-readable instructions including logic that is the same as orsimilar to the patent document normalization logic 244 and/or thescoring logic 250, in accordance with one or more implementations.

At block 408, a chemical patent corpus may be generated. In someembodiments, the chemical patent corpus may be generated from theplurality of unified patent documents. The chemical patent corpus mayinclude one or more chemical entities extracted from the plurality ofunified patent document. Each of the one or more chemical entities mayinclude one or more relevancy annotations. The one or more relevancyannotations may indicate a relevance to the patent document from whichthe chemical entity is extracted. Operation according to block 408 maybe performed by one or more hardware processors configured bymachine-readable instructions including logic that is the same as orsimilar to patent corpus generating logic 246, in accordance with one ormore implementations.

At block 410, the chemical patent corpus may be provided to the chemicalentity recognition system 120. Accordingly, the chemical entityrecognition system 120 may tag the one or more chemical entities in acorresponding normalized patent document of the plurality of unifiedpatent documents, extract one or more additional chemical entities fromthe plurality of unified patent documents, assign a confidence score toeach of the one or more additional chemical entities, and label each ofthe one or more additional chemical entities as relevant or irrelevantto an associated patent document based on information contained in thechemical patent corpus, as described in greater detail herein. Operationaccording to block 410 may be performed by one or more hardwareprocessors configured by machine-readable instructions including logicthat is the same as or similar to patent corpus providing logic 248, inaccordance with one or more implementations.

Referring now to FIG. 5, an illustrative method of classifying relevancyis depicted. The chemical patents are pulled through patent offices atblock 510. The patent source documents are normalized into a unifiedformat at block 520. They are then fed into the chemical entityrecognition system 530 that consists of two different named-entityextraction systems, Chemical Entity Recognizer (CER) 532 (Elsevier,Frankfurt DE) and a mining program 534 such as, for example, OCMiner(OntoChem, Halle DE). CER 532 extracts chemical entities and tags themin the normalized input document. OCMiner 534 further enriches theoutput of CER 532 by extracting additional chemical entities andassigning confidence scores to all extracted entities of both systems.The associated structures of chemical compounds extracted by CER 532 orOCMiner 534 are generated, validated, and standardized using a nameservice 536, such as, for example, the Reaxys Name Service (Elsevier,B.V., Amsterdam NL). The chemical annotations 542 in the patent corpus540 are used to train and test the chemical entity recognition system530. The relevancy annotations 544 in the corpus are used to train andtest the relevancy classifier 550, which labels the chemical entitiesextracted by the chemical entity recognition system as relevant orirrelevant at block 560. Each of the components will now be described ingreater detail.

Normalization

It may be necessary to normalize the variety of input sources and fileinto a unified text representation. The normalization step is performedby converting all input files (e.g. XML, HTML and PDF) into a unifiedXML representation format. Predefined XML tags corresponding toheuristic information such as document sections (title, abstract,claims, description and metadata) are used within this unifiedrepresentation. The normalization also converts all character encodingsinto a particular format, such as, for example, UTF-8 (8-bit UnicodeTransformation Format).

During normalization, a one-to-one mapping may be stored between eachcharacter in the original text and the corresponding character in thenormalized document. This may provide a possibility to go back to theoriginal document from the normalized text and vice versa. This may alsominimize efforts to update the annotations in the patent corpus in caseof changes in normalization methodology (note that the documents in thecorpus have also been normalized).

Patent Corpus Development

The development of the chemical patent corpus with chemical entity andrelevancy annotations may be completed in two phases. FIG. 6 depicts anillustrative corpus creation process 600. A first phase 610 focuses onbuilding a corpus with chemical entity annotations. The second phase 630may include using the corpus obtained from the first phase 610 to assignrelevancy annotations to the entities annotated in the first phase 610.In the second phase 630, annotators may also flag any compounds withspelling mistakes. For each phase, a set of well-defined guidelines maybe developed that help achieve annotation consistency.

Chemical Entity Annotation Guideline

The chemical entity annotation guideline according to blocks 610 and 612may be developed based on patent corpus development guidelines, such asthe guidelines mentioned in “Annotated chemical patent corpus: a goldstandard for text mining” authored by Akhondi, S. A., Klenner, A. G.,Tyrchan, C. et al. (2014) and published in PLoS One, 9, e107477 andincorporated herein by reference in its entirety. The guidelines definethe entities to be annotated. For each entity, positive and negativeexamples were provided. Additionally, any exception was defined andillustrated through examples. The guideline also defined how theannotation should be performed within the brat rapid annotation tool(available at http://brat.nlplab.org/). The brat tool allows onlineannotation of text using pre-defined entity types. Annotators were askedto annotate chemical compounds (e.g. tetrahydrofuran), chemical classes(e.g. zirconium alkoxide) and suffixes or prefixes of these compounds(e.g. ‘stabilized’ as prefix in ‘stabilized zirconia’ and‘nanoparticles’ as suffix in ‘silver nanoparticles’).

Chemical compounds could be annotated in three categories:mono-component compound (pure chemical compounds, e.g. systematicidentifiers, trivial names, elements, and chemical formulas), compoundmixture part (e.g. ‘Magnesiaflux’, which scientifically is a mixture of30% MgF₂ and 70% MgO) or prophetic compound (specific compounds that areuncharacterized within the text and are mentioned in claims ordescriptions only for intellectual property protection).

Compound classes could be annotated in six categories: chemical class(natural products or substructure names, e.g. heterocycle), biomolecules(e.g. insulin), polymers (e.g. polyethylene), mixture classes (e.g.opium), mixture part classes (e.g. quinupristin) or Markush (textualdescription of a Markush formula, e.g. H_(a)X_(b)C—C—H).

Relevancy Annotation Guideline

For the relevancy annotation according to block 630, a new set ofguidelines were developed, which defined how relevant compounds shouldbe identified. The legal status of a compound (e.g. prophetic orclaimed) and its characterization (e.g. NMR or MS measurement),properties (e.g. superconductivity), effects (e.g. toxicity) andtransformation (e.g. reaction) were taken into consideration fordefining the guidelines. The relevancy annotation did not includesuffixes and prefixes of compounds. In brief, relevancy is assigned asfollows: Prophetic compounds and Markush classes are relevant. Compoundmixture parts, mixture part classes, mixture classes, polymers, andbiomolecules are irrelevant. Mono-component compounds andchemical-classes are assigned relevance based on the context of the fullpatent text. They are considered relevant to the patent if (a) theentity is present in the title or abstract section of the patent, (b)the entity is part of a reaction context (e.g. product, intermediateproduct, catalyst or starting material used in synthetic procedures) or(c) the entity or its measured property belongs to the invention in theclaim section and is connected to the core invention of the patentdocument. The mono-component compounds and chemical classes areirrelevant if (a) the entity is only introduced for further explanationand is described beyond the invention, (b) the entity is described forreference or comparison or (c) the entity is involved in a chemicalreaction but not a starting material, product or catalyst.

Data Selection

Patent documents can be long and extensive. Annotation of full-textdocuments can be time-consuming and expensive. Complexity may be reducedby selecting snippets of patent text from a large set of patentdocuments that represented the diversity of the data according to block616. For example, all EPO patents with IPC class A or C (correspondingto chemistry) from a 3-month period in 2016 may be downloaded. This mayyield 19,274 patents, which are divided into snippets as follows. First,each patent is divided into six snippets containing title, abstract,claims, description, metadata, and non-English section of the patent.Second, since the performance of the brat toolkit drops on long files,snippets of more than 50 paragraphs are further divided into multiplesnippets. From this set of snippets, a small set was selected forannotation at block 618.

Random stratified sampling may be performed based on the sub-classes ofIPC A and C (list available athttps://www.wipoint/classifications/ipc/en/). In addition, the followingconditions were satisfied: 10% of the snippets were from titles, 10%from abstracts, 40% from claims, and 40% from descriptions, and allsnippets were from different patents.

A total of 131 snippets were selected, which constitute a patent corpus.The IPC sub-classes that occurred most frequently were A61K, A61B, C07D,A61F, A61M and C12N.

Chemical Entity Annotation Process

In one example, ten (10) chemistry graduates were selected as annotatorsfor annotation according to block 620. The annotators were located indifferent European countries. To train the annotators, 11 of the 131patent snippets were distributed among the annotators using the bratannotation tool. The snippets were pre-annotated at block 618 with anuntuned version of the chemical entity recognition software that is usedin the present disclosure (only for categories monocomponent compoundand chemical class). The pre-annotations were displayed in brat, andannotators were asked to modify incorrect pre-annotated entities (wrongboundary or entity type) and add missing entities according to theguideline, as depicted in FIG. 7.

Still referring to FIG. 6, the eleven (11) snippets were also annotatedby two subject-matter experts (SMEs) who defined the guidelines. TheSMEs had PhDs in chemistry and about 15 years of professional experiencein the field. Any discrepancies between the annotations of the two SMEswere resolved in consensus discussions. The resulting annotations (thetraining corpus) were used as a reference and compared to theannotations of each of the other annotators by inter-annotator agreement(IAA) scores. The F-score (harmonic mean of recall and precision) wasused as a measure of IAA. Several review sessions were held to compareannotations and resolve inconsistencies, and the annotation guidelinewas updated for clarity if needed. For each annotator, trainingcontinued until the IAA between the annotator and the SMEs was at least85%.

After successful completion of the training, the remaining 120 snippetsof the corpus were distributed between the annotators. Each snippet wasannotated by three annotators, after which the annotations wereharmonized at block 622. The harmonization was done for each entity asfollows: if at least two annotators agreed on the entity boundaries andthe entity type, that annotation was added to the gold-standard set,otherwise an SME adjudicated the disagreement. This resulted in thechemical entity annotation at block 624.

Relevancy Annotation Process

The same training corpus of 11 snippets was also annotated for relevantcompounds by the annotators and the SMEs at block 632. They wereprovided with the reference annotations of the chemical entities and hadto indicate whether the annotations were relevant or not. For everysnippet, the corresponding full patent text was delivered to theannotators and the SMEs. This allowed them to determine relevance basedon the complete document, which included title, abstract, descriptionand claims. The relevancy annotations of the annotators and SMEs werecompared, and questions were resolved at blocks 636 and 638.

After training, the 120 snippets of the chemical entity corpus createdin the previous step were distributed between the annotators. Eachsnippet was annotated by five annotators. If more than three annotatorsannotated the chemical entity as relevant it was considered relevant. Ifthree annotators annotated the chemical entity as relevant it wasconsidered equivocal. If less than three annotators annotated thechemical entity as relevant, it was considered irrelevant. The equivocalcategory was introduced since relevance determination is sometimescomplex and judged differently by different experts (as relevance isdecided based on the full text). To capture this complexity, no attemptto resolve ambiguity by enforcing a decision by the SMEs was made. Asper the guidelines developed in block 634, relevance is document based.As a result, if a compound is considered relevant at one occurrence inthe snippet, it is marked automatically relevant at any otheroccurrence. Finally, the annotators were also asked to annotate anyspelling errors. This annotation can be helpful for improvement ofchemical entity recognition systems. As spelling errors can be hard todetect, each spelling-error annotation was accepted, irrespective of thenumber of annotators that made that annotation. The corpus was dividedinto a development and test set consisting of 50 and 70 snippets,respectively.

Chemical Entity Recognition

Non-statistical approaches for chemical entity recognition were focusedon, as a chemical structure was to be associated to extracted chemicalcompounds. A dictionary-based approach was used in combination with amorphology-based approach to identify chemical entities. The structuresof these compounds were produced, validated and standardized usingReaxys Name Service described herein. Since the gold-standardannotations showed that only a small set of relevant entities are fromcompound class categories (see results), we decided to reduce ourchemical entity recognition scope to the identification andclassification of chemical compounds.

Name Service

The Reaxys system uses a name-to-structure toolkit (Reaxys Name Service)and a set of standardization rules (e.g. eliminate hydrogen bonds whenconstructing structures) when new compounds are inserted into thedatabase. In the present disclosure, the Name Service was used toconvert names to structures and standardize those structures as well asthe structures in different dictionaries based on the Reaxysstandardization rules, and to validate the structures assigned tochemical compounds.

Chemical Entity Recognizers

An ensemble system was used for chemical entity recognition. First,Elsevier's CER software was used. CER identifies and tags chemicalcompounds and their physical properties (e.g. color, melting point, andboiling point) within a text document and converts extracted compoundsinto a chemical structure (e.g., using Name Service). In addition, CERalso identifies chemical reactions and chemical properties within thepatent document. The software uses a combination of dictionary-based andmorphology-based approaches to extract chemical compounds from patents.CER was loaded with a dictionary derived from the manually curatedcompounds in the Reaxys database. Further, an exclusion list was used tofilter out any noise (e.g. frequent compounds such as oxygen) from theextracted compounds. The morphology-based approach in CER identifiesdifferent elements within a compound and combines them to create thefinal compound only if it can validate the compound based on itsstructural chemistry (e.g. can two elements bind with each other in thismanner). This validation is done on the structural level and through aset of pre-defined rules processed by the Name Service. CER cannotassign the extracted compounds to the different compound groups that aredefined in the guidelines.

Second, a mining software program (e.g., a modified version of OCMiner)was used to identify chemical entities. OCMiner also uses adictionary-based approach along with a morphology-based approach toextract chemical compounds. The dictionary used for OCMiner wasgenerated from a compound database built from various publicly availablesources such as PubChem, DrugBank, ChEMBL, ChEBI, and/or the like. Toimprove the quality of the dictionary, frequent chemical identifiersthat were associated to more than one structure were manually resolvedand the name-to-structure mappings of the most-frequent identifiers weremanually validated. OCMiner also used other resolution mechanisms toimprove the quality of the dictionary (e.g. counting the number ofstereocenters). The Name Service was used to standardize the compoundswithin these dictionaries based on the same standardization rulesapplied by CER and Reaxys. In comparison to CER, OCMiner has additionalfunctionality, such as abbreviation expansion and spelling-errorcorrection. The software also has post-dictionary modules to identifysystematic names. In a separate module built for the present disclosure,OCMiner cleans up the chemical entities identified by both CER andOCMiner (e.g. overlapping annotations and combination of simpleannotations to complex entities) and assigns compounds to the differentcompound groups. Finally, OCMiner generates a confidence score for allrecognized chemical entities extracted by CER or OCMiner.

Relevancy Classification

Relevance of a chemical compound is defined based on the context of thefull patent document. To identify the relevance of a specific entity,the complete patent document should be analyzed for that entity.Therefore, statistical information was gathered for each unique entity(recognized in the snippet) from the whole patent text and used thatinformation to classify the extracted entity. Relevancy classificationwas expressed as a scalar relevance score that after normalization canvary between zero (irrelevant) and one (relevant). The corpus wasdivided into a training set and a test set to experimentally find thebest threshold for relevancy classification. The training set was usedalong with the relevance score to define the best cut-off point for therelevancy classification. The results were then tested on the test set.

Relevance Score

Several features derived from the full text are used to calculate therelevancy score. The relevancy score is a linear combination of thesefeatures, where the coefficients (or weights) are heuristicallydetermined. These features include the following:

-   A. Compound frequency: Frequency of the compound within the patent    document. Usually compounds that occur frequently in a patent    document are less relevant (due to the nature of patents), unless    the compound is unique to the patent.-   B. Compound section: Occurrence of the compound within specific    sections of a patent document (e.g. title and claim). A compound in    a claim section is more relevant than a compound in a description    section of a patent. If a compound appears in multiple sections, the    compound may be prioritized based on which of the sections it    appears in the following order: Title, Abstract, Claim, and    Description.-   C. Compound length: Length of the extracted term. Longer names may    be more likely to be International Union of Pure and Applied    Chemistry (IUPAC) names and hence have a higher chance of being    relevant.-   D. Surrounding characters: Occurrence of the compound within special    characters (e.g. ‘[’, ‘(’). Examples are usually mentioned between    special characters and they will be less relevant.-   E. Compound section uniqueness: Compound single occurrence within a    section of the patent. If a compound is mentioned once in the claims    and a few times in the description, the compound has higher    probability to be relevant than the other way around.-   F. Compound without solvent: If the compound does not contain    solvents or laboratory chemicals, there is a higher probability of    the compound being relevant.-   G. Compound wide usage: Presence of the compound in one of a number    of predefined groups representing the frequency of compounds in a    large set of chemistry patent documents. To create the groups, all    chemical entities from a large set of patent documents (selection of    chemical patents in 2015, excluding patents from the patent corpus)    were extracted using OCMiner and ranked according to their frequency    of occurrence. The resultant compound list was divided in 16    equally-sized groups (16 an arbitrary number). Note here that the    calculation is extended to data derived from a larger set of patent    documents. If a compound is frequently mentioned in other patent    documents, then there is a lower probability of it being relevant.

It should be understood that the above mentioned features may later beused by a machine learning algorithm, such as, for example, a machinelearning algorithm contained within the chemical entity recognitionsystem 120, to determine whether a particular chemical entity isrelevant to the patent document from which the chemical entity wasextracted.

Performance Evaluation

The performance of the system against the gold-standard annotations wasevaluated using recall, precision and F-score, given the number of truepositives (TP), false positives (FP), and false negatives (FN). For theentity recognition task, TP represents the total number of correctlyidentified chemical entities by the system (based on starting and endingposition of the entity in text), FP represents the number of entitieswrongly identified by the system, and FN represents the number ofentities that are missed by the system. Recall, precision and F-scoremetrics are calculated as follows: recall=TP/(TP+FN),precision=TP/(TP+FP) and F-score=2×precision×recall/(precision+recall).

For the relevancy classification task, TP, FP and FN are determined atthe document level and only take into account the unique entitiesidentified in each of the documents. TP represents the number ofcompounds correctly classified as relevant, FP represents the number ofcompounds wrongly classified as relevant by the system, and FNrepresents the number of relevant compounds missed by the system. Thecompounds in the corpus that were annotated as equivocal weredisregarded from relevancy calculation. This choice was made for thosecompounds where evidently human annotators could not agree on theirrelevance.

Results

Chemical Entity Annotation

The average IAA between the annotators on the 11 training documentsinitially was 72% and reached 92% after two rounds of training. On thegold-standard set of 120 snippets, the average IAA between theannotators and the harmonized annotations was 87%. This was higher thanthe IAA between pre-annotation and the gold-standard (77% formono-component compound and 23% for chemical class) indicating thatannotators considerably changed the pre-annotations. Table 1 belowprovides the frequency of entities within the corpus. Overall, 18,789chemical entities were annotated, of which 15,199 were chemicalcompounds and 3,590 were chemical classes. This resulted in an averageof around 150 annotations per snippet. The majority of the annotationsconsisted of mono-component compounds (13,564). In addition, the corpuscontains 1848 relationships from chemical compound or classes to 628suffix or prefixes annotations (a suffix or prefix can have arelationship with one or more chemical compounds or classes).

Relevancy Annotation

All 18,789 chemical entities were annotated for relevance, as shown inTable 1 below. Of the 15,199 compounds, 1509 (9.9%) were consideredrelevant and 362 (2.4%) were equivocal. Of the 3590 chemical classes,266 (7.4%) were relevant, while 30 (0/8%) were equivocal. Thus, themajority of entities were considered irrelevant (87.7% of the compoundsand 91.8% of the classes).

TABLE 1 Number of Annotations in the Gold-Standard Set AnnotationAnnotation Anno- type subtype tation Relevant Equivocal IrrelevantCompounds Mono 13,564 883 362 12,319 Component Mixture part 1010 0 01010 Prophetic 625 625 0 0 Classes Chemical 1848 249 30 1569 ClassBiomolecule 1039 0 0 1039 Markush 17 17 0 0 Mixture 286 0 0 286 Mixture174 0 0 174 Part Polymer 226 0 0 226 Total 18,789 1774 392 16,623Chemical Entities Other Suffix and 628 — — — Prefix Relation 1848 — — —

TABLE 2 Performance of the chemical entity recognition system oncompound recognition for different confidence score thresholdsConfidence Score Development F- Test F- Threshold Precision Recall ScorePrecision Recall Score 0.0 88.5 79.3 83.6 86.5 82.3 84.3 0.1 88.6 79.183.6 89.1 82.3 85.6 0.2 89.1 78.9 83.7 90.1 82.3 86.2 0.3 89.1 78.6 83.590.1 81.6 85.7 0.4 89.1 78.4 83.4 90.1 81.5 85.6 0.5 89.1 78.4 83.4 90.181.5 85.6 0.6 89.1 78.4 83.4 90.1 81.3 85.5 0.7 87.2 60.6 71.5 90.7 69.478.6 0.8 82.0 36.2 50.3 96.2 39.8 56.3 0.9 100.0 0.1 0.2 96.4 0.8 1.71.0 100.0 0.1 0.2 97.2 0.8 1.7

Chemical Entity Recognition

The performance of the chemical entity recognition system on compoundrecognition is shown in Table 2 above for different thresholds of theconfidence score. On the development set, a threshold of 0.2 yielded thebest F-score of 83.7% (precision, 89.1%, and recall, 78.9%). For thisthreshold, the best result was also obtained on the test set (F-score,86.2%; precision, 90.1%; and recall, 82.3%). Error analysis of theresults indicated that the performance of the system may further beimproved by better recognizing prophetic compounds, reactants, andproducts of synthesis procedures.

Relevancy Classification

FIG. 8 depicts the performance of the chemical entity recognition systemfor different relevance score thresholds on the training set. The bestperformance (in terms of F-score) was obtained for a relevance scorethreshold of 0.53, with a precision of 85%, a recall of 87% and anF-score of 86%. For the same threshold, the performance on the test setwas slightly lower with 81% precision and 82% recall, resulting in anF-score of 82%. Further investigation into the compounds that the systemclassified as relevant showed that 97% of these compounds were annotatedas chemical compounds in the chemical entity corpus. Therefore, only 3%of the compounds classified by the system as relevant were not chemicalentities.

The relevancy classification is dependent on the performance of thechemical entity recognition system in two ways. First, only compoundsthat are found by the CER can be classified as relevant. Second, therelevance-score features for a given chemical entity are based on thefull patent text. The recognizer needs to correctly identify alloccurrences of that entity in the full text. To assess the effect of thefirst dependency on the performance of the relevance system, thegold-standard chemical entities were fed as input to the chemical entityrecognition system (simulating a scenario where the chemical entityrecognition system has a precision and recall of 100%). Apart from thepatent snippet, all other parts of the full patent document wereanalyzed with the original system because gold-standard annotations werenot available. When evaluated on the test set, the relevanceclassification system obtained 93% precision, 88% recall and 91%F-score. Further investigation into these scores indicated that thesystem could have performed better if the second dependency is alsoeliminated.

The contribution of individual relevancy features to the performance ofthe chemical entity classification system was investigated. For this,each feature was removed in turn from the relevance score and therelevance score threshold was adjusted for optimal performance. Table 3below shows that the length of the compound is a major indicator of therelevance of the compound (10 percentage points added value).Additionally, the patent section in which the compound was found andcompound wide usage in other publications are also good indicators ofthe relevance of the compound (around 5 percentage points added value).The other features contribute between 1 and 2 percentage points to therelevancy classification performance.

As can be seen from Table 3 below, leaving out a feature can affect theoptimal value of the relevance-score threshold. FIG. 9 shows theperformance of the chemical entity classification system as a functionof the threshold value when a feature is left out.

TABLE 3 The added value of individual features based on “leave-one-out”methodology Thres- Preci- Re- F- Added Setting hold sion call ScoreValue All features 0.53 84.8 86.8 85.8 — A-Compound Frequency 0.47 82.886.2 84.5 1.3 B-Compound Section 0.40 95.5 0.0 80.8 5.0 C-CompoundLength 0.40 75.9 75.5 75.7 10.1 D-Surrounding Characters 0.53 85.1 82.984.0 1.8 E-Compound Section 0.53 84.8 82.9 83.9 1.9 UniquenessF-Compound Without Solvent 0.53 85.1 82.9 84.0 1.8 G-Compound Wide Usage0.53 83.9 76.4 80.0 5.8

DISCUSSION

Relevance of a chemical compound is based on the context of the fullpatent document. Generally, a relevant compound is a compound that playsa major role in the patent document (e.g. a product of a reaction thatis mentioned in the Claims section of a patent document). The presentdisclosure shows that these compounds are a small subset (<10%) of allcompounds mentioned in the textual part of a patent document.

The present disclosure presents a two-step approach to identify relevantcompounds in patent documents: compound identification (first step)followed by compound classification (second step). This approach allowsthe use of the output of the first step for additional purposes (such asindexing chemical compounds mentioned in patent documents) but at thesame time introduces dependencies. Obtaining high precision and recallvalues in the first step is essential for the success of the secondstep. An ensemble approach combining dictionary-based andmorphology-based approaches were used to obtain high precision andrecall. These approaches require a small annotated corpus and canprovide a structural representation of the extracted compounds.Associating correct chemical structures to compounds is essential whenextracting chemical compounds. To reduce the possibility of associatinga compound with the wrong structure, the structures of compounds wereregenerated in different databases to structure toolkit (Name Service)and standardized the structures based on standardization rules used forReaxys.

The structures of non-systematic identifiers associated with a compoundwithin Reaxys are manually drawn by excerpters and are later validatedand standardized using Name Service. Adding such structures to the NameService database allowed a generation of structures for nonsystematicidentifiers. The same toolkit with the same standardizationfunctionalities was used to validate compounds extracted using thegrammar-based approach. This ensures high quality and consistency of theextracted compounds.

To build the chemical entity recognition system, a patent corpusannotated with chemical entities and their relevance was needed.Currently available patent corpora either are limited to subsections ofthe patent documents, mostly title and abstract, or had otherlimitations that prevented their use, such as different guidelinedefinitions (focus on different entity types), harmonization approaches(manual using SMEs vs automation), low or unidentified IAA scores andlimited scope of coverage (only one chemical IPC class or one section ofa document). The corpus was developed in two steps. First, a chemicalentity corpus using random stratified sampling for content selection andmanual harmonization was constructed to ensure high quality. Later, thiscorpus was extended with relevancy annotations. The inherent difficultyof classifying relevance of some compounds by introducing ‘equivocal’ asa classification was taken into account in the corpus. Chemicalcompounds identified as equivocal can be classified as both relevant andirrelevant. The system can assign relevant or irrelevant for compoundsextracted in this area. Any compound identified as equivocal wasdisregarded from evaluation. Using five annotators for relevancyannotation, the equivocal label is only limited to about 2% of thecompounds.

Normalized patent documents were used to develop the corpus and thesystem. Any change in the normalization approach will lead to changes tothe corpus and might result in a need for retraining the system. Thisdependency was reduced by finalizing the normalization before developingthe corpus and the software. One-to-one mapping between the originalpatent document and the normalized patent document was also introducedto allow possible changes to the corpus with limited efforts. Thechemical entity recognition system has lower dependency to thenormalization step as its performance is calculated on unique mentionsof compounds within a patent. The dependency to the normalization steprelies on the quality of the patent source file. Digital patentdocuments (e.g. from EPO or USPTO) have a higher quality than OCR patentdocuments (e.g. from WIPO)]. Therefore, the system is more dependable onthe normalization when dealing with OCR patents.

The chemical entity recognition system showed a precision of 90.1% and arecall of 82.3% for compound recognition on EPO patents. Thestate-of-the-art statistical systems (tested on patent title andabstract) have obtained higher recall (precision of 87.5% and recall of91.3%). These systems do not generate structures for the identifiedchemical compounds. Error analysis of the system disclosed hereinindicated that the loss in recall in our system is mainly due to thefact that reactants and products of synthesis procedures are notrecognized, and prophetic compounds are missed. Identification ofprophetic compounds may be improved by taking into account triggerphrases (e.g. ‘The compound of claim is:’, ‘A compound selected from’)or negative triggers for these compounds (e.g. ‘catalysts’).

It should now be understood that systems, methods, and computer-readablemedia described herein automatically extract chemical compounds from apatent document and determine the chemical compound's relevance to thatpatent document. The systems, methods, and computer-readable mediadescribed herein include a training device that is particularlyconfigured to pull patent documents from a database, normalize thepatent documents, and feed the patent documents to a chemical entityrecognition system such that the chemical entity recognition system,once trained, can automatically recognize chemical compounds within thenormalized patent documents and determine whether the chemical compoundsare relevant or irrelevant to the associated patent documents.

While particular embodiments have been illustrated and described herein,it should be understood that various other changes and modifications maybe made without departing from the spirit and scope of the claimedsubject matter. Moreover, although various aspects of the claimedsubject matter have been described herein, such aspects need not beutilized in combination. It is therefore intended that the appendedclaims cover all such changes and modifications that are within thescope of the claimed subject matter.

1. A method of training a chemical entity recognition system to extractone or more chemical compounds from a patent document and determine arelevance of the one or more chemical compounds to the patent document,the method comprising: obtaining, by a processing device, a plurality ofpatent documents from one or more patent databases; normalizing, by theprocessing device, each patent document of the plurality of patentdocuments into a unified format to achieve a plurality of unified patentdocuments; generating, by the processing device, a chemical patentcorpus from the plurality of unified patent documents, the chemicalpatent corpus comprising one or more chemical entities extracted fromthe plurality of unified patent document, each of the one or morechemical entities comprising one or more relevancy annotations, the oneor more relevancy annotations indicating a relevance to the patentdocument from which the chemical entity is extracted; and providing, bythe processing device, the chemical patent corpus to the chemical entityrecognition system, wherein the chemical entity recognition system, inresponse to receiving the chemical patent corpus, tags the one or morechemical entities in a corresponding normalized patent document of theplurality of unified patent documents, extracts one or more additionalchemical entities from the plurality of unified patent documents,assigns a confidence score to each of the one or more additionalchemical entities, and labels each of the one or more additionalchemical entities as relevant or irrelevant to an associated patentdocument based on information contained in the chemical patent corpus.2. The method of claim 1, wherein obtaining the plurality of patentdocuments from the one or more patent databases comprises obtainingpatent documents that are classified as chemistry related patentdocuments.
 3. The method of claim 1, wherein normalizing each patentdocument of the plurality of patent documents comprises converting theplurality of patent documents into a unified XML representation format,utilizing one or more predefined XML tags corresponding to heuristicinformation within the plurality of patent documents, and storingone-to-one mapping between each character in an original text of eachpatent document and a corresponding character in a normalized patentdocument.
 4. The method of claim 1, wherein generating the chemicalpatent corpus comprises: identifying a chemical compound within textcontained in each patent document of the plurality of unified patentdocuments; accessing a physical properties database and obtaining one ormore physical properties of the identified chemical compound; andgenerating a chemical structure corresponding to the chemical compoundbased on the one or more physical properties.
 5. The method of claim 4,wherein identifying the chemical compound comprises utilizing one ormore of a dictionary-based approach and a morphology-based approach toidentify the chemical compound, wherein the morphology-based approachcomprises identifying one or more elements within the chemical compoundand combining the one or more elements to create the chemical compoundif the chemical compound is validated based on a structural chemistry ofthe chemical compound.
 6. The method of claim 1, wherein generating thechemical patent corpus from the plurality of unified patent documentscomprises annotating each of the plurality of unified patent documentswith one or more of a chemical compound, a compound class, a suffix of achemical compound, and a prefix of a chemical compound.
 7. The method ofclaim 6, wherein the chemical compound is selected from a mono-componentcompound, a compound mixture part, or a prophetic compound.
 8. Themethod of claim 6, wherein the compound class is selected from achemical class, a biomolecule, a polymer, a mixture class, a mixturepart class, or a Markush class.
 9. The method of claim 1, wherein theone or more relevancy annotations comprise: a relevant compoundindicated for a prophetic compound or a Markush class; and an irrelevantcompound indicated for a compound mixture part, a mixture part class, amixture class, a polymer, or a biomolecule.
 10. The method of claim 1,wherein the one or more relevancy annotations for a mono-componentcompound or a chemical class are assigned based on a context of thecorresponding unified patent document.
 11. The method of claim 1,wherein the confidence score is calculated based on one or more of afrequency of a compound in a patent document, an occurrence of acompound within predefined sections of a patent document, a length of aterm, an occurrence of a compound within special characters, anoccurrence of a single compound within a section of a patent document, acompound not containing solvents or laboratory chemicals, and a presenceof a compound in one or more predefined groups representing a frequencyof compounds in a large set of chemistry patent documents.
 12. A systemconfigured for training a chemical entity recognition system to extractone or more chemical compounds from a patent document and determine arelevance of the one or more chemical compounds to the patent document,the system comprising: one or more hardware processors; and anon-transitory, processor-readable storage medium comprising one or moreprogramming instructions thereon that, when executed, cause the one ormore hardware processors to: obtain a plurality of patent documents fromone or more patent databases, normalize each patent document of theplurality of patent documents into a unified format to achieve aplurality of unified patent documents, generate a chemical patent corpusfrom the plurality of unified patent documents, the chemical patentcorpus comprising one or more chemical entities extracted from theplurality of unified patent document, each of the one or more chemicalentities comprising one or more relevancy annotations, the one or morerelevancy annotations indicating a relevance to the patent document fromwhich the chemical entity is extracted, and provide the chemical patentcorpus to the chemical entity recognition system, wherein the chemicalentity recognition system tags the one or more chemical entities in acorresponding normalized patent document of the plurality of unifiedpatent documents, extracts one or more additional chemical entities fromthe plurality of unified patent documents, assigns a confidence score toeach of the one or more additional chemical entities, and labels each ofthe one or more additional chemical entities as relevant or irrelevantto an associated patent document based on information contained in thechemical patent corpus.
 13. The system of claim 12, wherein theprogramming instructions that cause the one or more hardware processorsto normalize each patent document of the plurality of patent documentscomprises programming instructions that, when executed, cause the one ormore hardware processors to convert the plurality of patent documentsinto a unified xml representation format, utilize one or more predefinedxml tags corresponding to heuristic information within the plurality ofpatent documents, and store one-to-one mapping between each character inan original text of each patent document and a corresponding characterin a normalized patent document.
 14. The system of claim 12, wherein theprogramming instructions that cause the one or more hardware processorsto generate the chemical patent corpus comprises programminginstructions that, when executed, cause the one or more hardwareprocessors to: identify a chemical compound within text contained ineach patent document of the plurality of unified patent documents;access a physical properties database and obtaining one or more physicalproperties of the identified chemical compound; and generate a chemicalstructure corresponding to the chemical compound based on the one ormore physical properties.
 15. The system of claim 12, wherein the one ormore relevancy annotations comprise: a relevant compound indicated for aprophetic compound or a Markush class; and an irrelevant compoundindicated for a compound mixture part, a mixture part class, a mixtureclass, a polymer, or a biomolecule.
 16. The system of claim 12, whereinthe one or more relevancy annotations for a mono-component compound or achemical class are assigned based on a context of the correspondingunified patent document.
 17. The system of claim 12, wherein theconfidence score is calculated based on one or more of a frequency of acompound in a patent document, an occurrence of a compound withinpredefined sections of a patent document, a length of a term, anoccurrence of a compound within special characters, an occurrence of asingle compound within a section of a patent document, a compound notcontaining solvents or laboratory chemicals, and a presence of acompound in one or more predefined groups representing a frequency ofcompounds in a large set of chemistry patent documents.
 18. Anon-transitory storage medium having executable instructions embodiedthereon for causing a processing device to: obtain a plurality of patentdocuments from one or more patent databases; normalize each patentdocument of the plurality of patent documents into a unified format toachieve a plurality of unified patent documents; generate a chemicalpatent corpus from the plurality of unified patent documents, thechemical patent corpus comprising one or more chemical entitiesextracted from the plurality of unified patent document, each of the oneor more chemical entities comprising one or more relevancy annotations,the one or more relevancy annotations indicating a relevance to thepatent document from which the chemical entity is extracted; and providethe chemical patent corpus to the chemical entity recognition system,wherein the chemical entity recognition system tags the one or morechemical entities in a corresponding normalized patent document of theplurality of unified patent documents, extracts one or more additionalchemical entities from the plurality of unified patent documents,assigns a confidence score to each of the one or more additionalchemical entities, and labels each of the one or more additionalchemical entities as relevant or irrelevant to an associated patentdocument based on information contained in the chemical patent corpus.19. The non-transitory storage medium of claim 18, wherein theexecutable instructions for causing the processing device to normalizeeach patent document of the plurality of patent documents compriseexecutable instructions for causing the processing device to convert theplurality of patent documents into a unified xml representation format,utilize one or more predefined xml tags corresponding to heuristicinformation within the plurality of patent documents, and storeone-to-one mapping between each character in an original text of eachpatent document and a corresponding character in a normalized patentdocument.
 20. The non-transitory storage medium of claim 18, wherein theprogramming instructions executable instructions for causing theprocessing device to generate the chemical patent corpus compriseexecutable instructions for causing the processing device to: identify achemical compound within text contained in each patent document of theplurality of unified patent documents; access a physical propertiesdatabase and obtaining one or more physical properties of the identifiedchemical compound; and generate a chemical structure corresponding tothe chemical compound based on the one or more physical properties.